Biostat 212a Homework 2
Due Feb 8, 2025 @ 11:59PM
1 ISL Exercise 4.8.1 (10pts)
Solution:
2 ISL Exercise 4.8.6 (10pts)
Solution:
3 ISL Exercise 4.8.9 (10pts)
Solution:
4 ISL Exercise 4.8.13 (a)-(i) (50pts)
Solution:
library(ISLR2)
library(MASS)
data("Weekly")
# Structure of the dataset
str(Weekly)'data.frame': 1089 obs. of 9 variables:
$ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
$ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
$ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
$ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
$ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
$ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
$ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
$ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
$ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
(a)
# Numerical summary
summary(Weekly) Year Lag1 Lag2 Lag3
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag4 Lag5 Volume Today
Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
Direction
Down:484
Up :605
# Plot the Volume over time
plot(Weekly$Year, Weekly$Volume, main="Trading Volume Over Time", xlab="Year", ylab="Volume", col="blue", pch=20)# Boxplot of market return (Today) by Direction
boxplot(Today ~ Direction, data=Weekly, main="Market Return by Direction", ylab="Today’s Return", col=c("red", "green"))# Correlation matrix (excluding categorical variables)
cor(Weekly[, -9]) # Exclude the Direction column Year Lag1 Lag2 Lag3 Lag4
Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
Lag5 Volume Today
Year -0.030519101 0.84194162 -0.032459894
Lag1 -0.008183096 -0.06495131 -0.075031842
Lag2 -0.072499482 -0.08551314 0.059166717
Lag3 0.060657175 -0.06928771 -0.071243639
Lag4 -0.075675027 -0.06107462 -0.007825873
Lag5 1.000000000 -0.05851741 0.011012698
Volume -0.058517414 1.00000000 -0.033077783
Today 0.011012698 -0.03307778 1.000000000
Volume has increased significantly over time. Returns (Today) have a mean near zero, indicating relatively balanced ups and downs in market movement. The Direction variable (Up/Down) suggests a roughly even split, meaning the market is not strongly biased in one direction.
Upward (Up) and downward (Down) movements have different distributions. Down days have more extreme negative outliers, indicating higher risk when the market declines. The median return is slightly higher for “Up” movements, but variance is similar for both.
(b)
# Fit logistic regression model
logistic_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
# Summary of the logistic regression model
summary(logistic_model)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
\(log(\frac{P(Up)}{1-P(Up)})\) = 0.26686−0.04127 × Lag1+0.05844 × Lag2−0.01066 × Lag3−0.02779 × Lag4−0.01447 × Lag5−0.02274 × Volume.
Yes, the p-values of Lag1 and Lag3 are less than 0.05, so they are statistically significant.
(c)
# Predict probabilities
pred_probs <- predict(logistic_model, type="response")
# Convert probabilities to class predictions (threshold = 0.5)
pred_classes <- ifelse(pred_probs > 0.5, "Up", "Down")
# Create confusion matrix
conf_matrix <- table(Predicted = pred_classes, Actual = Weekly$Direction)
# Compute accuracy
accuracy <- mean(pred_classes == Weekly$Direction)
# Print results
print(conf_matrix) Actual
Predicted Down Up
Down 54 48
Up 430 557
print(paste("Overall accuracy:", round(accuracy, 4)))[1] "Overall accuracy: 0.5611"
True Positives (TP) = 557 ; False Positives (FP) = 430 ; True Negatives (TN) = 54 ; False Negatives (FN) = 48
Accuracy= \(\frac{TP+TN}{Total Samples} = \frac{557+54}{54+48+430+557}=0.5611\) . This is only slightly better than random guessing (50%).
The model is biased towards predicting “Up”, as indicated by the large number of false positives (FP = 430). The model fails to predict “Down” accurately, with only 54 correct “Down” predictions out of 484 actual “Down” instances.
(d)
# Split the dataset
train <- Weekly$Year < 2009
train_data <- Weekly[train, ]
test_data <- Weekly[!train, ]
# Fit logistic regression using Lag2
logistic_model_lag2 <- glm(Direction ~ Lag2, data=train_data, family=binomial)
# Predict on test data
test_probs <- predict(logistic_model_lag2, newdata=test_data, type="response")
# Convert probabilities to class labels
test_preds <- ifelse(test_probs > 0.5, "Up", "Down")
# Compute confusion matrix
conf_matrix_test <- table(Predicted = test_preds, Actual = test_data$Direction)
# Compute accuracy
test_accuracy <- mean(test_preds == test_data$Direction)
# Print results
print(conf_matrix_test) Actual
Predicted Down Up
Down 9 5
Up 34 56
print(paste("Test accuracy:", round(test_accuracy, 4)))[1] "Test accuracy: 0.625"
(e)
# Fit LDA model
lda_model <- lda(Direction ~ Lag2, data=train_data)
# Predict on test data
lda_preds <- predict(lda_model, newdata=test_data)
# Extract class predictions
lda_classes <- lda_preds$class
# Create confusion matrix
conf_matrix_lda <- table(Predicted = lda_classes, Actual = test_data$Direction)
# Compute accuracy
lda_accuracy <- mean(lda_classes == test_data$Direction)
# Print results
print(conf_matrix_lda) Actual
Predicted Down Up
Down 9 5
Up 34 56
print(paste("LDA test accuracy:", round(lda_accuracy, 4)))[1] "LDA test accuracy: 0.625"
(f)
# Fit QDA model
qda_model <- qda(Direction ~ Lag2, data=train_data)
# Predict on test data
qda_preds <- predict(qda_model, newdata=test_data)
# Extract class predictions
qda_classes <- qda_preds$class
# Compute confusion matrix
conf_matrix_qda <- table(Predicted = qda_classes, Actual = test_data$Direction)
# Compute accuracy
qda_accuracy <- mean(qda_classes == test_data$Direction)
# Print results
print(conf_matrix_qda) Actual
Predicted Down Up
Down 0 0
Up 43 61
print(paste("QDA test accuracy:", round(qda_accuracy, 4)))[1] "QDA test accuracy: 0.5865"
(g)
library(class)
# Prepare training and test data
train_X <- train_data$Lag2
test_X <- test_data$Lag2
train_Y <- train_data$Direction
test_Y <- test_data$Direction
# Apply KNN with K=1
knn_preds <- knn(train = matrix(train_X), test = matrix(test_X),
cl = train_Y, k = 1)
# Compute confusion matrix
conf_matrix_knn <- table(Predicted = knn_preds, Actual = test_Y)
# Compute accuracy
knn_accuracy <- mean(knn_preds == test_Y)
# Print results
print(conf_matrix_knn) Actual
Predicted Down Up
Down 21 30
Up 22 31
print(paste("KNN (K=1) test accuracy:", round(knn_accuracy, 4)))[1] "KNN (K=1) test accuracy: 0.5"
(h)
library(e1071)
# Fit Naive Bayes model
nb_model <- naiveBayes(Direction ~ Lag2, data=train_data)
# Predict on test data
nb_preds <- predict(nb_model, newdata=test_data)
# Compute confusion matrix
conf_matrix_nb <- table(Predicted = nb_preds, Actual = test_data$Direction)
# Compute accuracy
nb_accuracy <- mean(nb_preds == test_data$Direction)
# Print results
print(conf_matrix_nb) Actual
Predicted Down Up
Down 0 0
Up 43 61
print(paste("Naive Bayes test accuracy:", round(nb_accuracy, 4)))[1] "Naive Bayes test accuracy: 0.5865"
(i)
# Create a comparison table
model_comparison <- data.frame(
Model = c("Logistic Regression", "LDA", "QDA", "KNN (K=1)",
"Naive Bayes"),
Accuracy = c(test_accuracy, lda_accuracy, qda_accuracy,
knn_accuracy, nb_accuracy)
)
# Print comparison results
print(model_comparison) Model Accuracy
1 Logistic Regression 0.6250000
2 LDA 0.6250000
3 QDA 0.5865385
4 KNN (K=1) 0.5000000
5 Naive Bayes 0.5865385
The Logistic Regression and LDA appear to have the best results on this data, and they both have 0.625 accuracy.
5 Bonus question: ISL Exercise 4.8.13 Part (j) (30pts)
Solution:
(j) Logistic Regression with multiple predictors
logistic_model_extended <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=train_data, family=binomial)
# Predictions
test_probs_extended <- predict(logistic_model_extended, newdata=test_data, type="response")
test_preds_extended <- ifelse(test_probs_extended > 0.5, "Up", "Down")
# Confusion Matrix
conf_matrix_logistic_extended <- table(Predicted = test_preds_extended, Actual = test_data$Direction)
logistic_accuracy_extended <- mean(test_preds_extended == test_data$Direction)
print(conf_matrix_logistic_extended) Actual
Predicted Down Up
Down 31 44
Up 12 17
print(paste("Extended Logistic Regression Accuracy:", round(logistic_accuracy_extended, 4)))[1] "Extended Logistic Regression Accuracy: 0.4615"
Logistic Regression with interaction terms
logistic_model_interaction <- glm(Direction ~ Lag2 * Volume,
data=train_data, family=binomial)
# Predictions
test_probs_interaction <- predict(logistic_model_interaction,
newdata=test_data, type="response")
test_preds_interaction <- ifelse(test_probs_interaction > 0.5, "Up", "Down")
# Confusion Matrix
conf_matrix_logistic_interaction <- table(Predicted = test_preds_interaction,
Actual = test_data$Direction)
logistic_accuracy_interaction <- mean(
test_preds_interaction == test_data$Direction
)
print(conf_matrix_logistic_interaction) Actual
Predicted Down Up
Down 20 25
Up 23 36
print(paste("Logistic Regression with Interaction Accuracy:",
round(logistic_accuracy_interaction, 4)))[1] "Logistic Regression with Interaction Accuracy: 0.5385"
LDA with More Predictors
library(MASS)
lda_model_extended <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=train_data)
# Predictions
lda_preds_extended <- predict(lda_model_extended, newdata=test_data)$class
# Confusion Matrix
conf_matrix_lda_extended <- table(Predicted = lda_preds_extended, Actual =
test_data$Direction)
lda_accuracy_extended <- mean(lda_preds_extended == test_data$Direction)
print(conf_matrix_lda_extended) Actual
Predicted Down Up
Down 31 44
Up 12 17
print(paste("Extended LDA Accuracy:", round(lda_accuracy_extended, 4)))[1] "Extended LDA Accuracy: 0.4615"
QDA with More Predictors
qda_model_extended <- qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=train_data)
# Predictions
qda_preds_extended <- predict(qda_model_extended, newdata=test_data)$class
# Confusion Matrix
conf_matrix_qda_extended <- table(Predicted = qda_preds_extended, Actual =
test_data$Direction)
qda_accuracy_extended <- mean(qda_preds_extended == test_data$Direction)
print(conf_matrix_qda_extended) Actual
Predicted Down Up
Down 33 49
Up 10 12
print(paste("Extended QDA Accuracy:", round(qda_accuracy_extended, 4)))[1] "Extended QDA Accuracy: 0.4327"
Tuning K for KNN
library(class)
# Function to evaluate KNN for different K values
knn_evaluate <- function(k) {
knn_preds <- knn(train=as.matrix(train_data[, c("Lag2")]),
test=as.matrix(test_data[, c("Lag2")]),
cl=train_data$Direction, k=k)
conf_matrix_knn <- table(Predicted = knn_preds, Actual = test_data$Direction)
knn_accuracy <- mean(knn_preds == test_data$Direction)
return(list(conf_matrix=conf_matrix_knn, accuracy=knn_accuracy))
}
# Experiment with different values of K
knn_results <- lapply(c(1, 3, 5, 7, 10, 15, 20), knn_evaluate)
# Print results for each K
for (i in 1:length(knn_results)) {
print(paste("KNN with K =", c(1, 3, 5, 7, 10, 15, 20)[i]))
print(knn_results[[i]]$conf_matrix)
print(paste("Accuracy:", round(knn_results[[i]]$accuracy, 4)))
}[1] "KNN with K = 1"
Actual
Predicted Down Up
Down 21 30
Up 22 31
[1] "Accuracy: 0.5"
[1] "KNN with K = 3"
Actual
Predicted Down Up
Down 16 19
Up 27 42
[1] "Accuracy: 0.5577"
[1] "KNN with K = 5"
Actual
Predicted Down Up
Down 16 21
Up 27 40
[1] "Accuracy: 0.5385"
[1] "KNN with K = 7"
Actual
Predicted Down Up
Down 15 20
Up 28 41
[1] "Accuracy: 0.5385"
[1] "KNN with K = 10"
Actual
Predicted Down Up
Down 18 18
Up 25 43
[1] "Accuracy: 0.5865"
[1] "KNN with K = 15"
Actual
Predicted Down Up
Down 20 20
Up 23 41
[1] "Accuracy: 0.5865"
[1] "KNN with K = 20"
Actual
Predicted Down Up
Down 21 19
Up 22 42
[1] "Accuracy: 0.6058"
Naive Bayes with More Predictors
library(e1071)
nb_model_extended <- naiveBayes(Direction ~ Lag1 + Lag2 +
Lag3 + Lag4 + Lag5 + Volume, data=train_data)
# Predictions
nb_preds_extended <- predict(nb_model_extended, newdata=test_data)
# Confusion Matrix
conf_matrix_nb_extended <- table(Predicted = nb_preds_extended, Actual =
test_data$Direction)
nb_accuracy_extended <- mean(nb_preds_extended == test_data$Direction)
print(conf_matrix_nb_extended) Actual
Predicted Down Up
Down 42 56
Up 1 5
print(paste("Extended Naive Bayes Accuracy:", round(nb_accuracy_extended, 4)))[1] "Extended Naive Bayes Accuracy: 0.4519"
Comparing All Models
# Create a comparison table
model_comparison <- data.frame(
Model = c("Logistic Regression", "logistic_model_extended",
"Logistic Regression (Interaction)",
"LDA", "LDA (Extended)", "QDA", "QDA (Extended)",
"KNN (K=1)","KNN (K=3)", "KNN (K=5)", "KNN (K=7)",
"KNN (K=10)", "KNN (K=15)", "KNN (K=20)",
"Naive Bayes", "Naive Bayes (Extended)"),
Accuracy = c(test_accuracy, logistic_accuracy_extended,
logistic_accuracy_interaction,
lda_accuracy, lda_accuracy_extended,
qda_accuracy, qda_accuracy_extended,
knn_results[[1]]$accuracy, knn_results[[2]]$accuracy,
knn_results[[3]]$accuracy, knn_results[[4]]$accuracy,
knn_results[[5]]$accuracy, knn_results[[6]]$accuracy,
knn_results[[7]]$accuracy,
nb_accuracy, nb_accuracy_extended)
)
# Print comparison results
print(model_comparison) Model Accuracy
1 Logistic Regression 0.6250000
2 logistic_model_extended 0.4615385
3 Logistic Regression (Interaction) 0.5384615
4 LDA 0.6250000
5 LDA (Extended) 0.4615385
6 QDA 0.5865385
7 QDA (Extended) 0.4326923
8 KNN (K=1) 0.5000000
9 KNN (K=3) 0.5576923
10 KNN (K=5) 0.5384615
11 KNN (K=7) 0.5384615
12 KNN (K=10) 0.5865385
13 KNN (K=15) 0.5865385
14 KNN (K=20) 0.6057692
15 Naive Bayes 0.5865385
16 Naive Bayes (Extended) 0.4519231
6 Bonus question: ISL Exercise 4.8.4 (30pts)
Solution:
(a) Since the feature \(X\) is uniformly distributed in the range\([0,1]\), We consider a test observation and use only those within 10% of its range.
If a test point is at \(𝑋 = 0.6\), we use observations in the range: \([ 0.55 , 0.65]\). The total range is 1, so the fraction of data used is\(\frac{selected\ range}{total\ range} = \frac{0.65 − 0.55}{1} = 0.1\). Therefore, we use 10% of the data.
(b) We now have two features, \((X_1,X_2)\), both uniformly distributed on \([ 0 , 1] × [ 0 , 1]\).
To predict the response, we use only observations within 10% of both \(X_1\) and \(X_2\).
When \(X_1=0.6,\ X_2=0.35\), we will use: \(X_1\in[0.55,0.65] ,\ X_2 \in [0.3,0.4]\).
The fraction of data used is \(\frac{selected\ range}{total\ range} = \frac{0.10 * 0.10 }{1} = 0.01\). Therefore, we use 1% of the data.
(c)
Now, we have 100 features, all uniformly distributed in \([0,1]\). We use only the observations within 10% of each feature’s range.
For each feature \(X_i\), we select observations within: \([X_i−0.05,\ X_i+0.05]\) (assuming the test observation is not too close to the boundaries).
Since the features are independent, the fraction of observations in each dimension is 0.1. The fraction of data used is \((0.1)^{100}\), which is very small number.
Therefore, in 100D space, almost none of the training data is “close” to the test point, making KNN ineffective.
(d)
From previous results, in 1D, we use 10% of the data; In 2D, we use 1%; In 100D, we use \((0.1)^{100}\), which is practically zero.
Because most points are far away, and distances between them are almost uniform, and even with millions of points, find “near” neighbors is unlikely. What’s more, searching for neighbors in high-dimensional space is inefficient.
Therefore, KNN performs poorly in high dimensions because there are too few nearby training points, making predictions unreliable.
(e)
We define a p-dimensional hypercube centered at a test point that contains 10% of the total training data.
Volume of the hypercube in \(p-dimensional\) space is: \(Fraction\ of\ Data\ Used = ( Side\ Length )^𝑝\)
Let \(s\) be the side length of the hypercube. To contain 10% of data, we solve: \(s^p = 0.1\).
For \(𝑝= 1\): \(s^1 = 0.1\), \(s=0.1\). In 1D, we take 10% of the range.
For \(𝑝= 2\): \(s^2 = 0.1\), \(s=\sqrt{0.1} \approx 0.316\). In 2D, the square has side length around 0.316, much larger than in 1D.
For \(𝑝= 100\): \(s^{100} = 0.1\), \(s=0.1^{\frac{1}{100}} = 10^{-0.01} \approx 0.977\). In 100D, the side length is about 0.977, meaning almost the entire space is included.
In high dimensions, the hypercube must be almost the entire space to contain just 10% of the data. This confirms why KNN is ineffective in high dimensions: Everything is “far” apart, so the notion of “nearest neighbors” breaks down.
Thus, KNN works well in low dimensions because neighbors are meaningful. In high dimensions, the space becomes too sparse, making KNN ineffective.